# README file for *From Keywords to Clusters: AI-Driven Analysis of YouTube Comments to Reveal Election Issue Salience in 2024*

*Raisa M. Simoes, Ph.D; Timoteo Kelly, M.S; Eduardo J. Simoes, MD, MSc, MPH; Praveen Rao, Ph.D*

## Overview

This repository contains data, code, and documentation for partial replication of the analyses presented in our study of YouTube comments, focusing on discourse around five key topics: **Immigration, Inflation, Public Health, Identity Politics, and Democracy**.

The materials here include processed datasets, replication code for generating select figures, and the necessary environment specifications to reproduce portions of the analysis. **The full research project code used for the complete analysis pipeline is available upon request.**

## Repository Contents

### 1. Code

**`figures_code.ipynb`**  
Jupyter Notebook containing the replication code used to generate Figures 2 and 3. The notebook utilizes processed datasets to recreate the visualizations of topic frequencies across the analyzed comments.

### 2. Data

**`preliminary_dataset.csv`**  
Raw dataset obtained after scraping and initial preprocessing of YouTube comments.

**`relevantcomments_dataset.tab`**  
A filtered subset of the full dataset containing only the comments classified into the five target categories (Immigration, Inflation, Public Health, Identity Politics, Democracy). Contains 11 variables and 1377 observations.

**`post_analysis_dataset.xlsx`**  
Dataset after running additional analyses and transformations, ready for final model fitting and visualization.

**`topic_frequency.csv`**  
Aggregated topic frequency data used to generate Figures 2 and 3.

### 3. Documentation

**`README.md`**  
This replication guide.

### 4. Dependencies

**`requirements.txt`**  
List of Python package dependencies required to run the replication code. Please use this file to set up your Python environment.

## Replication Instructions

Install the required Python packages listed in `requirements.txt`. It is recommended to use a virtual environment.

Open and run `figures_code.ipynb` in Jupyter Notebook or Jupyter Lab. The notebook will read the processed data files and generate Figures 2 and 3 as shown in the original analysis.

Ensure that all data files are located in the same directory as the notebook, or adjust the file paths in the notebook accordingly.

## Notes

The file `preliminary_dataset.csv` is provided for transparency regarding data collection and preprocessing. However, the primary analyses and figures are based on the filtered and processed datasets (`relevantcomments_dataset.tab` and `post_analysis_dataset.xlsx`).

MD5 hashes are provided for file integrity verification.

The code provided in this package supports replication of specific figures only. **The full codebase used for the complete research project, including data collection, cleaning, classification, and modeling, is available upon request.** Please contact the authors for access.

## Contact

For questions, replication support, or access to the full project code, please contact the corresponding author: Raisa M. Simoes, simoes.raisa@gmail.com.
